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Abstract. In the Bayesian approach to sequential decision making, ex- 
act calculation of the (subjective) utility is intractable. This extends to 
most special cases of interest, such as reinforcement learning problems. 
While utility bounds are known to exist for this problem, so far none of 
them were particularly tight. In this paper, we show how to efficiently cal- 
culate a lower bound, which corresponds to the utility of a near-optimal 
memoryless policy for the decision problem, which is generally different 
from both the Bayes-optimal policy and the policy which is optimal for 
the expected MDP under the current belief. We then show how these 
can be applied to obtain robust exploration policies in a Bayesian rein- 
forcement learning setting. 



1 Setting 

We consider decision making problems where an agent is acting in a (possibly 
unknown to it) environment. By choosing actions, the agent changes the state 
of the environment and in addition obtains scalar rewards. The agent acts so 
as to maximise the expectation of the utility function: XJ t = Y^k=t'f k ' rk ' wnere 
7 G [0, 1] is a discount factor and where the instantaneous rewards r t £ [0, r max ] 
are drawn from a Markov decision process (MDP) /j,, defined on a state space 
S and an action space A, both equipped with a suitable metric and c-algebra, 
with a set of transition probability measures {7^f' a | s £ S, a G A} on S , and a 
set of reward probability measures {VJ^ 11 \ s £ S,a £ A} on K, such that: 

r t | s t = s,a t = a ~ n 8 *, s t +i | s t = s,a t = a~T°' a , (1.1) 

where St £ S and at £ A are the state of the MDP, and the action taken by 
the agent at time t, respectively. The environment is controlled via a policy 
7r G V. This defines a conditional probability measure on the set of actions, 
such that ¥ v (a t G A | s*,a t_1 ) = tt(A | s^a* -1 ) is the probability of the 
action taken at time t being in A, where we use P, with appropriate subscripts, 
to denote probabilities of events and s 4 = sj., . . . , s* and a* -1 = ai, . . . , at-i 
denotes sequences of states and actions respectively. We use Vk to denote the 
set of fc-order Markov policies. Important special cases are the set of blind policies 
Vq and the set of memoryless policies V\ . A policy in 7r G Vk C Vk is stationary, 
when n(A \ s*_ fe+1 , a'l^ +1 ) = ir(A \ s k ,a k ' 1 ) for all t. 
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The expected utility, conditioned on the policy, states and actions is used to 
define a value function for the MDP /i and a stationary policy 7T, at stage t: 

Ql !t (s,a)±E^(U t \s t = s,a t = a), V£ t (s) 4 E^(U t \ s t = a), (1.2) 

where the expectation is taken with respect to the process defined jointly by 
/i, 7r on the set of all state-action-reward sequences (S, A., M)*. The optimal value 
function is denoted by Q* t = sup w Q" t and V^* t = sup w V^ t . We denote the 

optimal polic£| for fi by n*. Then Q* t = and V* >t = V*%. 

There are two ways to handle the case when the true MDP is unknown. 
The first is to consider a set of MDPs such that the probability of the true 
MDP lying outside this set is bounded from above [e.g. S3, EH, S, H 0, Hi | . 



The second is to use a Bayesian framework, whereby a full distribution over 
possible MDPs is maintained, representing our subjective belief, such that MDPs 
which we consider more likely have higher probability [e.g. EH, E3, HH, 0, Ell- 
Hybrid approaches are relatively rare |16j |. In this paper, we derive a method for 
efficiently calculating near-optimal, robust, policies in a Bayesian setting. 



1.1 Bayes-optimal policies 

In the Bayesian setting, our uncertainty about the Markov decision process 
(MDP) is formalised as a probability distribution on the class of allowed MDPs. 
More precisely, assume a probability measure £ over a set of possible MDPs M, 
representing our belief. The expected utility of a policy tt with respect to the 
belief £ is: 

E^U t = [ E„ tr (Ut)d{(n). (1.3) 

J M 

Without loss of generality, we may assume that all MDPs in M. share the same 
state and action space. For compactness, and with minor abuse of notation, we 
define the following value functions with respect to the belief: 

Ql t (s,a)±E 6i7r (U t \s t = s,a t = a), V£ t (s) 4 E^(U t \ s t = s), (1.4) 

which represent the expected utility under the belief £, at stage t, of policy tt, 
conditioned on the current state and action. 

Definition 1 (Bayes-optimal policy). A Bayes-optimal policy 7r| with re- 
spect to a belief £ is a policy maximising il.3\) . Similarly to the known MDP 
case, we use Q| t , V^ t to denote the value functions of the Bayes-optimal policy. 

Finding the Bayes-optimal policy is generally intractable [ll|, EH, EH • It is im- 
portant to note that a Bayes-optimal policy is not necessarily the same as the 
optimal policy for the true MDP. Rather, it is the optimal policy given that the 
true MDP was drawn at the start of the experiment from the distribution £. All 
the theoretical development in this paper is with respect to £. 

1 We assume that there exists at least one optimal policy. If there are multiple optimal 
policies, we choose arbitrarily among them. 
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1.2 Related work and main contribution 

Since computation of the Bayes-optimal policy is intractable in the general case, 
in this work we provide a simple algorithm for finding near-optimal memoryless 
policies in polynomial time. By definition, for any belief £, the expected utility 
under that belief of any policy 7r is a lower bound on that of the optimal policy 
7t| . Consequently, the near-optimal memoryless policy gives us a tight lower 
bound on the subjective utility. 

A similar idea was used in 12], where the stationary policy that is optimal 
on the expected MDP is used to obtain a lower bound. This is later refined 
through a stochastic branch-and-bound technique that employs a similar upper 
bound. In a similar vein, [17| uses approximate Bayesian inference to obtain 
a stationary policy for the current belief. More specifically, they consider two 
families of expectation maximisation algorithms. The first uses a variational 
approximation to the reward-weighted posterior of the transition distribution, 
while the second performs expectation propagation on the first two moments. 
However, none of the above approaches return the optimal stationary policy. 

It is worthwhile to mention the very interesting point-based Beetle algo- 
rithm of Poupart et al. [23j], which discretised the belief space by sampling a 
set of future beliefs (rather than MDPs). Using the convexity of the utility with 
respect to the belief, they constructed a lower bound via a piecewise-linear ap- 
proximation of the complete utility from these samples. The approach results 
in an approximation to the optimal non-stationary policy. Although the algo- 
rithm is based on an optimal construction reported in the same paper, sufficient 
conditions for its optimality are not known. 

In this paper, we obtain a tight lower bound for the current belief by cal- 
culating a nearly optimal memoryless policy. The procedure is computationally 
efficient, and we show that it results in a much tighter bound than the value of 
the expected-MDP-optimal policy. We also show that it can be used in practice 
to perform robust Bayesian exploration in unknown MDPs. This is achieved by 
computing a new memoryless policy once our belief has changed significantly, a 
technique also employed by other approaches H, [H, Ell 31 [ . It can be seen as a 



principled generalisation of the sampling approach suggested in 29] from a single 
MDP sample to multiple samples from the posterior. The crucial difference is 
that, while previous work uses some form of optimistic policy, we instead employ 
a more conservative policy in each stationary interval. This can be significantly 
better than the policy which is optimal for the expected MDP. 

The first problem we tackle is how to compute this policy given a belief 
over a finite number of MDPs. For this, we provide a simple algorithm based on 
backwards induction [see [III, for example]. In order to extend this approach to an 
arbitrary MDP set, we employ Monte Carlo sampling from the current posterior. 
Unlike other Bayesian sampling approaches Ull [H, S HH HH > we use these 
samples to estimate a policy that is nearly optimal (within the restricted set of 
memoryless policies) with respect to the distribution these samples were drawn 
from. Finally, we provide theoretical and experimental analyses of the proposed 
algorithms. 
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2 MMBI: Multi-MDP Backwards Induction 

Even when our belief £ is a probability measure over a finite set of MDPs M. 1 the 
finding an optimal policy is intractable. For that reason, we restrict ourselves to 
memoryless policues 7r 6 "P\ . We can approximate the optimal memoryless policy 
with respect to £, by setting the posterior measure given knowledge of the policy 
so far and the current state, to equal the initial belief, i.e. | St = s, n) = 
(we do not condition on the complete history, since the policies are memoryless). 
The approximation is in practice quite good, since the difference between the 
two measures tends to be small. The policy 7Tmmbi can then be obtained via the 
following backwards induction. By definition: 

Q^ t (s,a) = E ii7T (r t | s t = s,a t = a) + jE£ tW (U t+ i | s t = s,a t = a), (2.1) 

where the expected reward term can be written as 

%7r(r-t | s t = s,a t = a) = / E^(r t \ s t = s,a t = a) d£(^i), (2.2a) 

J M 



E li (r t \s t = s,a t = a)= rd^"(r). (2.2b) 

J — oo 

The next-step utility can be written as: 

E^ v (U t +i \s t = s,a t = a)= / E^^(U t+ i \ s t = s,a t = a)d£(/i), (2.3a) 

J M 

E^(U t+ j | at = 8,at = a) = [ V^ t+1 (s')dT^ a (s'). (2.3b) 

Putting those steps together, we obtain Algorithm [1] which greedily calculates 
a memoryless policy for a T-stage problem and returns its expected utility. 



Algorithm 1 MMBI - Backwards induction on multiple MDPs. 
procedure MMBI(X, f, 7, T) 
Set V^,t+i(s) = for all s G S. 
for t = T,T- 1,...,0 do 
for s G S, a £ A do 

Calculate Qz,t{s,a) from 1)2. 1[) using {V Mit +i} • 
end for 
for s G S do 

a|,t(s) = argmax {Q^,t(s, a) | a G A}. 
for fj, G M do 

V^.tfs) = Q M>i (s,o| it (s)). 
end for 
end for 
end for 
end procedure 
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The calculation is greedy, since op- 
timising over 7r implies that at any 
step t + k, we must condition the be- 
lief on past policy steps | St+fc — 
s, 7r t , . . . , 7r t+ fc_i) to calculate the ex- 
pected utility correctly. Thus, the op- 
timal itt+k depends on both future 
and past selections. Nevertheless, it 
is easy to see that Alg. [1] returns 
the correct expected utility for time 
step t. Theorem [1] bounds the gap 
between this and the Bayes-optimal 
value function when the difference be- 
tween the current and future beliefs is 
small. 
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Fig. 1. Value function bounds. 
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Theorem 1. For any k 6 [*,T], Zei 6 - £( 
servations. Let X be a dominating measure on M. and 
for any X-measurable function f. If ||£ t — 
ttmmbi found by MMBI is within 

1 max (1-7)- 



be the posterior after k ob- 
XJ/(M) I dA(M), 



< e, for all k, then the policy 
e of the Bayes-optimal policy 7r|. 



II A,X 
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Proof. The error at every stage k > t, is bounded as follows: 



\V i , k {s)-^{U k \s\a k )\ 



&(M)-a(M)(s)]VM(s)dA(/i) 

<^ / |6(M)-a(M)( S )|dA(„)<^.. 
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The final result is obtained via the geometric series. □ 

We can similarly bound the gap between the MMBI policy and the ^-optimal 
memoryless policy, by bounding sup fe s 7r ||£ t (-) - | s k = s, tt) || A ,i - 

The ^-optimal memoryless policy is generally different from the policy which 
is optimal with respect to the expected MDP jj,^ = Ej /i, as can be seen via 
counterexample where ^ V? , or even where E^ [i ^ A4 . MMBI can be 

used to obtain a much tighter value function bound than the /^-optimal policy, 
as shown in Fig. [1] where the MMBI bound is compared to the /^-optimal policy 
bound and the simple upper bound: V^*(s) < E^ max, V*(s). The figure shows 
how the bounds change as our belief over 8 MDPs changes. When we are more 
uncertain, MMBI is much tighter than /i^-optimal. However, when most of the 
probability mass is around a single MDP, both lower bounds coincide. In further 
experiments on online reinforcement learning, described in Sec. near-optimal 
memoryless policies are compared against the /i^-optimal policy. 



2.1 Computational complexity 

When A4 is finite and T < oo, MMBI (Alg. [T]) returns a greedily-optimised 
policy 7Tmmbi and its value function. When T —> oo, MMBI can be used to 
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calculate an e-optimal approximation by truncating the horizon, as shown below. 

Lemma 1. The complexity of Alg.{l\for bounding the value function error by e, 
iS o([\M\\S\ 2 (\A\ + l) + (l + \M\)\S\\A\]lo gj e -^y assuming r f £ [0,r max ], 

Proof. Since r t G [0, r max ], if we look up to some horizon T, our value function 
error is bounded by 7 T c, where c = Hr max and H = j^- is the effective horizon. 
Consequently, we need T > log (e/c) to bound the error by e. For each t, step [5] 
is performed |<->||.A| times. Each step takes 0(|7W |) operations for the expected 
reward and <3(|<S||.A/(|) operations for the next-step expected utility. The second 
loop is 0(|5|(|^4| + |.M||«S|)), since it is performed |5| times, with the max oper- 
ators taking operations, while inner loop is performed \A4\ times with each 
local MDP update step [10] takes \S\ operations. □ 



Algorithm 2 MSBI: Multi-Sample Backwards Induction 
1: procedure MSBI(£, 7, e) 

3: M = {fXl, . . . ,fJm}, ~ £- 

4: MMBI(X,p,7,log 7 with p(^) = 1/n for all i. 

5: end procedure 



It is easy to see that the most significant term is 0(|.M||c>| 2 |.A|), so the algorith- 
mic complexity scales linearly with the number of MDPs. Consequently, when M. 
is not finite, exact computation is not possible. However, we can use high prob- 
ability bounds to bound the expected loss of a policy calculated stochastically 
through MSBI (AlgEJ). 

MSBI simply takes a sufficient number of samples of MDPs from £, so that 
in ^-expectation, the loss relative to the MMBI policy is bounded according to 
the following lemma. 

Lemma 2. The expected loss of MSBI relative to MMBI is bounded by e. 

Proof. Let E U = ^ ^ denote the empirical expected utility over the 

sample of n MDPs, where the policy subscript 7r is omitted for simplicity. Since 

a n 

E^ E U — E^ U, we can use the Hoeffding inequality to obtain: 

£ 1 E n U > E € U + eX) < e~ 2n ^ jc \ 
This implies the following bound: 

E,(EV - E, U) < cS + < c(8 n)- 1 /3 + = Sen" 1 / 3 . 

V 2n V 2n 
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Let V\ be the set of memoryless policies. Since the bound holds uniformly (for 
any it 6 V), the policy n* g V\ maximising E is within 3cn -1 / 3 of the ^-optimal 
policy in 'Pi. □ 

Finally, we can combine the above results to bound the complexity of achieving 
a small approximation error for MSBI, with respect to expected loss: 

Theorem 2. MSBI (Alg.Wi) requires O (^(^ffz^j) 3 |«S| 2 |^|log 7 opera- 
tions to be e-close to the best MMBI policy. 

Proof. From Lem. [2j we can set n = (6c/e) 3 to bound the regret by e/2. Using 
the same value in Lcm.[IJ and setting \M\ = n, we obtain the required result. □ 



2.2 Application to robust Bayesian reinforcement learning 

While MSBI can be used to obtain a memoryless policy which is in expectation 
close to both the optimal memoryless policy and the Bayes-optimal policy for a 
given belief, the question is how to extend the procedure to online reinforcement 
learning. The simplest possible approach is to simply recalculate the stationary 
policy after some interval B > 0. This is the approach followed by MCBRL 
(Alg. [3]), shown below. 



Algorithm 3 MCBRL: Monte-Carlo Bayesian Reinforcement Learning 
1: procedure MCBRL(£ , 7, e, B) 
2: Calculate &(•) = ?o(- I s*, a'" 1 ). 

3: Call MSBI(£t, 7, e) and run returned policy for B steps. 
4: end procedure 



3 Experiments in reinforcement learning problems 

Selecting the number of samples n according to e for MCBRL is computationally 
prohibitive. In practice, instead of setting n via e, we simply consider increasing 
values of n. For a single sample (n = 1), MCBRL is equivalent to the sampling 
method in [29j], which at every new stage, samples a single MDP from the cur- 
rent posterior and then uses the policy that is optimal for the sampled MDP. 
In addition, for this particular experiment, rather than using the memoryless 
policy found, we apply the stationary policy derived by using the first step of 
the memoryless policy. This incurs a small additional loss. We also compared 
MCBRL against the common heuristic of acting according to the policy that is 
optimal with respect to the expected MDP p,^ = E^ /i. The algorithm, referred 
to as the Exploit heuristic in [23[, is shown in detail in Alg. 2] At every step, 
this calculates the expected MDP by obtaining the expected transition kernel 
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Algorithm 4 Exploit: Expected MDP exploitation [23| 

1: procedure Exploit(£o, t) 
2: for t = 1, . . . do 

3: Calculate &(•) = £ (- I s\ a*" 1 ). 

4: Estimate /ij t = E^ t fi. 

5: Calculate (s, a) using discount parameter 7. 

6: Select a t = arg max a Qj^ (s,a) 

7: end for 

8: end procedure 



and reward function under the current belief. It then acts according to the opti- 
mal policy with respect to £i£ . This policy may be much worse than the optimal 
policy, even within the class of stationary policies V\ . 




500 




(a) Expected regret estimate 



1 1.5 2 2.5 3 3.5 4 4.5 5 

total reward xlO -3 

(b) Empirical performance distribution 



Fig. 2. Performance on the chain task, for the first 10 steps, over 10 runs, (a) 



Expected regret relative to the optimal (oracle) policy. The sampling curve shows 
the regret of Alg. [3l as the number of samples increases, with 95% confidence 
interval calculated via a 10 4 -boostrap. The expected curve shows the performance 
of an algorithm acting greedily with respect to the expected MDP. |(b)| Empirical 
distribution of total rewards for: the expected MDP approach and MCBRL with 
n = 1 and n = 8 samples. 



We compared the algorithms on the Chain task [9(, commonly used to eval- 
uate exploration in reinforcement learning problems. Traditionally, the task has 
a horizon of 10 3 steps, a discount factor 7 = 0.95, and the expected total reward 
lE/i.Tr SfcLi r t is compared. We also report the expected utility E^.^C/t, which 
depends on the discount factor. All quantities are estimated over 10 4 runs with 
appropriately seeded random number generators to reduce variance^ The initial 
belief about the state transition distribution was set to be a product-Dirichlet 

2 In both cases this expectation is with respect to the distribution induced by the 
actual MDP [i and policy -k followed, rather than with respect to the belief £. 
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prior [see [IH with all parameters equal to |<S| 1 , while a product-Beta prior with 
parameters (1,1) was used for the rewards. 

Figure [3] summarises the results in terms of total reward. The left hand 



side (2(a) I shows the expected difference in total reward between the optimal 
policy 7T* and the used policy n, over T steps, i.e. the regret: C = E Mj7r Et=i r * — 
E^tt Et=i r t- The error bars denote 95% confidence intervals obtained via a 
10 4 -bootstrap [lij]. For n = 1, MCBRL performs worse than the expected MDP 
approach, in terms of total reward. On the other hand, as the number of samples 
increase, its performance monotonically improves. 



Some more detail on the behaviour of the algorithms is given in Figure 2(b) 
which shows the empirical performance distribution in terms of total reward. 
The expected MDP approach has a high probability of getting stuck in a sub- 
optimal regime. On the contrary, MCBRL, for n = 1, results in significant 
over-exploration of the environment. However, as n increases, MCBRL explores 
significantly less, while the number of runs where we are stuck in the sub-optimal 
regime remains small (< 1% of the runs). Table [T] presents comparative results on 



Model 


ES°n (EC/) 


80% percentile 


confidence interval 


Alg.H 


3287 (26.64) 


2518 


- 3842 


3275 - 3299 


n = 1 


3166 (28.50) 


2748 


- 3582 


3159 - 3173 


n = 8 


3358 (29.65) 


2932 


- 3800 


3350 - 3366 


n = 16 


3376 (29.95) 


2946 


- 3830 


3368 - 3384 


Model 


E t =i n 


Standard interval 


Beetle [23] 


1754 




1712-1796 


AMP-EM [17] 


2180 




2108-2254 


SEM [17] 


2052 




2000 -2111 



Table 1. Comparative results on the chain task. The 80% percentile interval is 
such that no more than 10% of the runs were above the maximum or below the 
minimum value. The confidence interval on the accuracy of the mean estimate, 
is the 95% bootstrap interval. The results for Beetle and the EM algorithms 
were obtained from the cited papers, with and the interval based on the reported 
standard deviation. 



the chain task for Alg. [Hand for MCBRL for n G {1, 8, 16} in terms of the total 
reward received in 10 3 steps. This enables us to compare against the results 
reported in [2^, [I?} ■ While the performance of Alg. H] may seem surprisingly 
good, it is actually in line with the results reported in [23[. Therein, Beetle 
only outperformed Alg. [4] in the Chain task when stronger priors were used. In 
addition, we would like to note that while the case n = 1 is worse than Alg. 0] 
for the total reward metric, this no longer holds when we examine the expected 
utility, where an improvement can already be seen for n = 1. 
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4 Discussion 

We introduced MMBI, a simple backwards induction procedure, to obtain a 
near-optimal memoryless policy with respect to a belief over a finite number 
of MDPs. This was generalised to MSBI, a stochastic procedure, whose loss is 
close in expectation to MMBI, with a gap that depends polynomially on the 
number of samples, for a belief on arbitrary set of MDPs. It is shown that MMBI 
results in a much tighter lower bound on the value function that the value of 
the /ij-optimal policy. In addition, we prove a bound on the gap between the 
value of the MMBI policy and the Bayes-optimal policy. Our results are then 
applied to reinforcement learning problems, by using the MCBRL algorithm to 
sample a number of MDPs at regular intervals. This can be seen as a principled 
generalisation of |29[, which only draws one sample at each such interval. Then 
MSBI is used to calculate a near-optimal memoryless policy within each interval. 
We show experimentally that this performs significantly better than following 
the /^-optimal policy. It is also shown that the performance increases as we make 
the bound tighter by increasing the number of samples taken. 

Compared to results reported for other Bayesian reinforcement learning ap- 
proaches on the Chain task, this rather simple method performs surprisingly 
well. This can be attributed to the fact that at each stage, the algorithm selects 
actions according to a nearly-optimal stationary policy. 

In addition, MSBI itself could be particularly useful for inverse reinforcement 
learning problems (see for example JH, [22|) where the underlying dynamics are 
unknown, or to multi-task problems [26j . Then it would be possible to obtain good 
stationary policies that take into account the uncertainty over the dynamics, 
which should be better than using the expected MDP heuristic. More specifically, 
in future work, MMBI will be used to generalise the Bayesian methods developed 
m [U, [H for the case of unknown dynamics. 



In terms of direct application to reinforcement learning, MSBI could be 
used in the inner loop of some more sophisticated method than MCBRL. For 
example, it could be employed to obtain tight lower bounds for the leaf nodes 



of a planning tree such as[l2[. By tight integration with such methods, we hope 
to obtain improved performance, since we would be considering wider policy 
classes. In a related direction, it would be interesting to see examine better 
upper bounds @, 0, 0] and in particular whether the information relaxations 
discussed by Brown et al. @ could be extended to the Bayes-optimal case. 
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